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Abstract. Wavelet trees are widely used in the representation of sequences, permutations, text collec- 
tions, binary relations, discrete points, and other succinct data structures. We show, however, that this 
still falls short of exploiting all of the virtues of this versatile data structure. In particular we show how 
to use wavelet trees to solve fundamental algorithmic problems such as range quantile queries, range 
next value queries, and range intersection queries. We explore several applications of these queries in 
Information Retrieval, in particular document retrieval in hierarchical and temporal documents, and in 
the representation of inverted lists. 

1 Introduction 

The wavelet tree [34] is a versatile data structure that stores a sequence S[l,n] of elements from 
a symbol universe [1, a] within asymptotically the same space required by a plain representation 
of the sequence, nlogo" (1 + o(l)) bits. 4 Within that space, the wavelet tree is able to return any 
sequence element S[i], and also to answer two queries on S that are fundamental in compressed 
data structures for text retrieval: 

rank c (S,i) = number of occurrences of symbol c in S[l, i], 
select c (S,j) = position of the jth occurrence of symbol c in S. 

The time for these three queries is C(logo"). 5 Originally designed for compressing suffix arrays 
[34], the usefulness of the wavelet tree for many other scenarios was quickly realized. It was soon 
adopted as a fundamental component of a large class of compressed text indexes, the FM-index 
family, giving birth to most of its modern variants [27, 43, 28, 45]. 

The connection between the wavelet tree and an old geometric structure by Chazelle [19] made 
it evident that wavelet trees could be used for range counting and reporting points in the plane. 
More formally, given a set of t points P = {(xi,yi), 1 < i < t] on a discrete grid [l,n] x [1,<t], 
wavelet trees answer the following basic queries: 

rang e -count {P, x s ,x e ,y s ,y e ) = number of pairs (xi,yi) such that x s < xt < x e , y s <yi< y e , 
rang e ^report (P, x s ,x e ,y s ,y e ) = list of those pairs (xi,yi) in some order, 

* Early parts of this work appeared in SPIRE 2009 [32] and SPIRE 2010 [53]. 
** Partially supported by Fondecyt Grant 1-080019, Chile. 
* * * Partially supported by the Australian Research Council. 

4 Our logarithms are in base 2 unless otherwise stated. Moreover, within a time complexity, log x should be under- 
stood as max(l, log x). 

5 This can be reduced to 0(1 + lo ^g n ) [28] using multiary wavelet trees, but these do not merge well with the new 
algorithms we develop in this article. 



both in C(logo") time [44]. 6 These new capabilities were subsequently used to design powerful 
succinct representations of two-dimensional point grids [44, 14, 16], permutations [12], and binary 
relations [7], with applications to other compressed text indexes [50,20,21], document retrieval 
problems [66] and many others. 

In this paper we show, by uncovering new capabilities, that the full potential of wavelet trees 
is far from realized. We show that the wavelet tree allows us to solve the following fundamental 
queries: 

range-quantile(S,i, j, k) = kth smallest value in S[i,j], 
rang e -next -value(S,i, j, x) = smallest S[r] > x such that i <r < j, 
range -intersect(S, i\, ji, . . . , ik,jk) = distinct common values in S[i±,ji], S[i2, 32], ■ ■ ■ , S[ik,jk]- 

The first two are solved in time O(logcr), whereas the cost of the latter is O(logcr) per delivered 
value plus the size of the intersection of the tries that describe the different values in S[i\, ji] and 
S[i2,j2\- A crude upper bound for the latter is 0(min(a, j\ — i\ + — 12 + 1)), however, we give 
an adaptive analysis of our method, showing it requires 0{a log ^) time, where a is the so-called 
alternation complexity of the problem [8]. 

All these algorithmic problems are well known. Har-Peled and Muthukrishnan [35] describe 
applications of range median queries (a particular case of range -quantile) to the analysis of Web 
advertizing logs. Stolinski et al. [64] use them for noise reduction in grey scale images. Similarly, 
Crochemore et al. [23] use range -next -value queries for interval-restricted pattern matching, and 
Keller et al. [40] and Crochemore et al. [22] use them for many other sophisticated pattern matching 
problems. Hon et al. [37] use range-intersect queries for generalized document retrieval, and in a 
simplified form the problem also appears when processing conjunctive queries in inverted indexes. 

We further illustrate the importance of these fundamental algorithmic problems by uncovering 
new applications in several Information Retrieval (IR) activities. We first consider document re- 
trieval problems on general sequences. This generalizes the classical IR problems usually dealt with 
on Natural Language (NL), and defines them in a more general setting where one has a collection 
C of strings (i.e., the documents), and queries are strings as well. Then one is interested in any sub- 
string of the collection that matches the query, and the following IR problems are defined (among 
several others): 

doc -listing {q) = distinct documents where query q appears, 
doc -frequency (q, d) = number of occurrences of query q in document d, 
doc-intersect(qi, . . . , q^j = distinct documents where all queries q±, . . . , appear. 

These generalized IR problems have applications in text databases where the concept of words 
does not exist or is difficult to define, such as in Oriental languages, DNA and protein sequences, 
program code, music and other multimedia sequences, and numeric streams in general. The interest 
in carrying out IR tasks on, say, Chinese or Korean is obvious despite the difficulty of automatically 
delimiting the words. In those cases one resorts to a model where the text is seen as a sequence 
of symbols and must be able to retrieve any substring. Agglutinating languages such as Finnish 
or German present similar problems to a certain degree. While indexes for plain string matching 
are well known, supporting more sophisticated IR tasks such as ranked document retrieval is a 

6 Again, this can be reduced to 0(l + ^'"f^ - ) using multiary wavelet trees [14]. 
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very recent research area. It is not hard to imagine that similar capabilities would be of interest 
in other types of sequences: for example listing the functions where two given variables are used 
simultaneously in a large software development system, or ranking a set of gene sequences by the 
number of times a given substring marker occurs. 

By constructing a suffix array A [47] on the text collection, one can obtain in time 0(\q\ log \C\) 
the range of A where all the occurrence positions of q in C are listed. The classical solution to 
document retrieval problems [49] starts by defining a document array D giving the document to 
which each suffix of A belongs. Then problems like document listing boil down to listing the distinct 
values in a range of D, and intersection of documents becomes the intersection of values in a range of 
D. Both are solved with our new fundamental algorithms (the former with range quantile queries). 
Other queries such as computing frequencies reduce to a pair of rankd queries on D. 

Second, we generalize document retrieval problems to other scenarios. The first scenario is 
temporal documents, where the document numbers are consistent with increasing version numbers 
of the document set. Then one is interested in restricting the above queries to a given interval of 
time (i.e., of document numbers). A similar case is that of hierarchical documents, which contain 
each other as in the case of an XML collection or a file system. Here, restricting the query to a 
range of document numbers is equivalent to restricting it to a subtree of the hierarchy. However, 
one can consider more complex queries in the hierarchical case, such as marking a set of retrievable 
nodes at query time and carrying out the operations with respect to those nodes. We show how to 
generalize our algorithms to handle this case as well. 

Finally, we show that variants of our new fundamental algorithms are useful to enhance the 
functionality of inverted lists, the favorite data structures for both ranked and full-text retrieval in 
NL. Each of these retrieval paradigms requires a different variant of the inverted list, and one has 
to maintain both in order to support all the activities usually required in an IR system. We show 
that a wavelet tree representation of the inverted lists supports not only the basic functionality of 
both representations within essentially the space of one, but also several enhanced functionalities 
such as on-the-fly stemming and restriction of documents, and most list intersection algorithms. 

The article is structured as follows. In Section 2 we review the wavelet tree data structure and 
its basic algorithmics. Section 3 reviews some basic IR concepts. Then Section 4 describes the new 
solutions to fundamental algorithmic problems, whereas Sections 5 and 6 explore applications to 
various IR problems. Finally we conclude in Section 7. 

2 Wavelet Trees 

A wavelet tree T [34] for a sequence S^Tn] over an ordered alphabet [l,er] is an ordered, strictly 
binary tree whose leaves are labeled with the distinct symbols in S in order from left to right, and 
whose internal nodes T v store binary strings B v . The binary string at the root contains n bits and 
each is set to or 1 depending on whether the corresponding character of S is the label of a leaf in 
T's left or right subtree. For each internal node v of T, the subtree T v rooted at v is itself a wavelet 
tree for the subsequence S v of S consisting of the occurrences of its leaf labels in T v . For example, 
if S = abracadabra and the leaves in T's left subtree are labeled a, b and c, then the root stores 
00100010010, the left subtree is a wavelet tree for abacaaba and the right subtree is a wavelet tree 
for rdr. 

In this article we consider balanced wavelet trees, where the number of leaves to the left and to 
the right of each node differ at most by 1. The important properties of such a wavelet tree for our 
purposes are summarized in the following lemma. 
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Lemma 1. The wavelet tree T for a sequence S[l,n] on alphabet [l,cr] with u distinct symbols 
requires at most n log a + 0(n) bits of space, and can be constructed in 0(n log u) time. 

Proof. By the description above the wavelet tree has height [log u] and can be easily built in time 
0(nlogu) (we need to determine the u < min(n, a) distinct values first, but this is straightforward 
within the same complexity). 

As for the space, note that the wavelet tree stores only the bitmaps B v for all the nodes. The 
total length of the binary strings is at most n at each level of the wavelet tree, which adds up to 
n[logu~|. Apart from the bitmaps, there is the binary tree of O(u) nodes. Instead of storing the 
nodes, one can concatenate all the bitmaps of the same depth and simulate the nodes [44], so this 
requires just one pointer per level, C(log u log n) = o(n) bits. 

The distinct values must be stored as well. Indeed, if a < n, we can just assume all the a 
values exist and the wavelet tree will have [log a] levels and the theorem holds. Otherwise, we can 
mark the unique values in a bitmap £/[l,cr], which can be stored in compressed form [54] so that 
it requires u\og^ + 0{u) bits and the ith distinct number is retrieved as selecti(U,i) in constant 
time 7 . Adding up all the spaces we get nlog-u + 0(n) + -ulog ^ + 0{u) < n log a + 0(n) bits, and 
the construction time is 0{u). 

Finally, we can represent the bitmaps with data structures that support constant-time (binary) 
rank and select operations [55]. The overall extra space stays within 0(n) bits and the construction 
time within 0(n\ogu). Binary rank and select operations are essential to operate on the wavelet 
trees, as seen shortly. □ 

The most basic operation of T is to replace S, by retrieving any S[i] value in 0(logu) time. 
The algorithm is as follows. We first examine the ith bit of the root bitmap B root . If B root [i] = 0, 
then symbol S[i] corresponds to a leaf descending by the left child of the root, and by the right 
otherwise. In the first case we continue recursively on the left child, T}. However, position i must now 
be mapped to the subsequence handled at T\. Precisely, if the at B root [i] is the jth in B root , then 
S[i] is mapped to S v [j]. In other words, when we go left, we must recompute i <— ranko(B root ,i). 
Similarly, when we go right we set % i — rank\( y B roo i^i^j. 

When the tree nodes are not explicit, we find out the intervals corresponding to B v in the 
levelwise bitmaps as follows. B root is a single bitmap. If node v has depth d, and B v corresponds 
to interval Bd[l, r], then its left child corresponds to Bd+i[l, k] and its right child to Bd+\[k + I, r], 
where k = ranko(B ( i, r) — ranko(B ( i, I — 1) [44]. 

The wavelet tree can also answer rank c (S, i) queries on S with a mechanism similar to that for 
retrieving S[i]. This time one decides whether to go left or right depending on which subtree of the 
current node the leaf labeled c appears in, and not on the bit values of B v . The final i value when 
one reaches the leaf is the answer. Again, the process requires 0(log u) time. 

Finally, select c (S, j) is also supported in O(logu) time using the wavelet tree. This time we 
start from position j at the leaf labeled c; 8 this indeed corresponds to the jth occurrence of symbol 
c in S. If the leaf is a left child of its parent v, then the position of that c in S v is selecto(B v , j), 
and selecti(B v , j) if the leaf is a right child of v. We continue recursively from this new j value 
until reaching the root, where j is the answer. 

7 For this, one has to use a constant-time data structure for select [48] in their internal bitmap H[l, 2u] [54]. 

8 If the tree nodes are not explicitly stored then we first descend to the node labeled c in order to delimit the interval 
corresponding to the leaf and to all of its ancestors in the levelwise bitmaps. 
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Algorithm 1 Basic wavelet tree algorithms: On the wavelet tree of sequence S, access {v roo t, i) 
returns S[i]; rank(u root , c, i) returns rank c {S,i); and select (v roo t, c, i) returns select c (S,i). 
access(u,i) 



if v is a leaf then 

return label(v) 
else if B v [i] — then 

return access(u;, ranko(B v , i)) 
else 

return access(t> r , ranki(B v , i)) 
end if 



rank(w, c, i) 

selectfu, c, i) 

if v is a leaf then 

return i if v is a leaf then 

else if c G labels(vi) then return i 

return rank(« ; , c, ranko(B v , i)) else if c € labels(vi) then 
else return selecto(B v , select(u;, c, i)) 

return rank(u r , c, rank\(B v , i)) else 
end if return selecti(B v , select(u r , c, i)) 

end if 



Algorithm 2 Range algorithms: count (v roo t, x s , x e , [y s , y e ]) returns range -count {P, x s ,x e ,y s ,y e ) 
on the wavelet tree of sequence P; and report (v roo t, x s , x e , [y s ,y e ]) outputs all pairs (y, /), where 
V s <y <y e and y appears / > times in P[x s , y s ]. 



count(u, x s ,x e ,rng) 


report(w, x s ,x e ,rng) 


if X s > y s V labels(v) n rn^ = then 


if £ s > y s V labels(v) n rng = then 


return 


return 


else if label(v) C rng then 


else if v is a leaf then 


return x e — x s + 1 


output (label(v),x e — x s + 1) 


else 


else 


:r ; s <- rank (B v ,x a - 1) + 1 


<- rank (B v ,x s - 1) + 1 


ranko(B v ,x e ) 


xf ranko(B v ,x e ) 






return count(v;, , xf , rng) + 


report («;,x f, xf, rng) 


count (v r ,x s r ,x e r , rng) 


report (« r , i", £ri rng) 


end if 


end if 



Algorithm 1 gives pseudocode for the basic access, rank and select algorithms on wavelet trees. 
For all the pseudocodes in this article we use the following notation: v is a wavelet tree node and 
v root is the root node. If v is a leaf then its symbol is labels(v) G [1,<t]. Otherwise vi and v r are its 
left and right children, respectively, and B v is its bitmap. For all nodes, labels(v) is the range of 
leaf labels that descend from v (a singleton in case of leaves). 

As we make use of range-count and a form of range-report queries in this article, we give 
pseudocode for them as well, in Algorithm 2. Indeed, range-count is a kind of multi-symbol rank 
and range-report is a kind of multi-symbol access. 

In Section 4 we develop new algorithms based on wavelet trees to solve fundamental algorith- 
mic problems. We prove now a few simple lemmas that are useful for analyzing range-count and 
range-report, as well as many other algorithms we introduce throughout the article. Most results 
are folklore but we reprove them here for completeness. 

Lemma 2. Any contiguous range of £ leaves in a wavelet tree is the set of descendants of 0(log£) 
nodes. 

Proof. Start with the i leaves. For each consecutive pair that shares the same parent, replace the 
pair by their parent. At most two leaves are not replaced, and at most 1/2 parents are created. 
Repeat the operation at the parent level, and so on. After working on [log£] levels, we have at 
most two nodes per wavelet tree level, for a total of 0(log i) nodes covering the original interval. 

Lemma 3. Any set of r nodes in a wavelet tree of u leaves has at most O(rlog^) ancestors. 
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Proof. Consider the paths from the root to each of the r nodes. They cannot be all disjoint. They 
share the least if they diverge from depth [log r] . In this case, all the O(r) tree nodes of depth up 
to [logr] belong to some path, and from that depth each of the r paths is disjoint, adding at most 
[log?/] — [logr] distinct ancestors. The total is 0(r + rlog ^). 

Lemma 4. Any set of r nodes covering a contiguous range of leaves in a wavelet tree of u leaves 
has at most 0(r + log u) ancestors. 

Proof. We first count all the ancestors of the £ consecutive leaves covered and then subtract the 
sizes of the subtrees rooted at the r nodes vi, V2, ■ ■ ■ , v r . Start with £ leaves. Mark all the parents of 
the leaves. At most \£/2] < 1 + 1/2 distinct parents are marked, as most pairs of consecutive leaves 
will share the same parent. Mark the parents of the parents. At most [(1 + £/2)/2~\ < 3/2 + £/4 
parents of parents are marked. At height h, the number of marked nodes is always less than 2+£/2 h . 
Adding over all heights, we have that the total number of ancestors is at most 2£ + 2 log u. Now let 
£i be the number of leaves covered by node Vi, so that 2~2i<i< r ^i = The subtree rooted at each 
Vi has 2£i — 1 nodes. By subtracting those subtree sizes and adding back the r root nodes we get 
2£ + 2 log u - {2£ - r) + r = 0(r + log u) . 

From the lemmas we conclude that count in Algorithm 2 (left) takes time O(logn): it finds the 
0(\og{y e — y s + 1)) nodes that cover the range [y s ,y e ] (Lemma 2), by working in time proportional 
to the number of ancestors of those nodes, 0(log(y e — y s + 1) + log u) = O(logu) (Lemma 4). 
Interestingly, report in Algorithm 2 (right) can be analyzed in two ways. On one hand, it takes 
time 0(y e — y s + log u) as it arrives at most at the y e — y s + 1 consecutive leaves and thus it works 
on all of their ancestors (Lemma 4). On the other hand, if it outputs r results (which are not 
necessarily consecutive), it also works proportionally to the number of their ancestors, O(rlog^) 
(Lemma 3). The latter is an output- sensitive analysis. The following lemma shows that the cost is 
indeed ©(log u + r log y ~ y r . 

Lemma 5. The number of ancestors of r wavelet tree leaves chosen from £ contiguous leaves, on 
a wavelet tree of u leaves, is o(logu + rlog . 

Proof. By Lemma 2 those leaves are covered by c = 0(\og£) nodes. Say that ri of the r searches 
fall within the ith of those subtrees, then by Lemma 3 the number of nodes accessed within that 
subtree is at most O^rjlog^-^, adding up by convexity to at most O^rlog^^. Given the limit 

on c this is o(r log ^j. The ancestors reached above those subtrees are 0(c + logn) = O(logu) by 

Lemma 4, for a total of O^log u + r log . 

3 Information Retrieval Concepts 
3.1 Suffix and Document Arrays 

Let C be a collection of documents (which are actually strings over an alphabet [1, <r]) D\,D2, ■ ■ ■ , D m 
Assume strings are terminated by a special character "$" , which does not occur elsewhere in the 
collection. Now we identify C with the concatenation of all the documents, C[l, n] = D\Di . . . D m . 
Each position i defines a suffix C[i, n\. A suffix array [47] of C is an array A[l, n] where the integers 
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[1, n] are ordered in such a way that the suffix starting at A[i] is lexicographically smaller than that 
starting at A[i + 1], for all 1 < i < n. 

Put another way, the suffix array lists all the suffixes of the collection in lexicographic order. 
Since any substring of C is the prefix of a suffix, finding the occurrences of a query string q in C 
is equivalent to finding the suffixes that start with q. These form a lexicographic range of suffixes, 
and thus can be found via two binary searches in A (accessing C for the string comparisons). As 
each step in the binary search may require comparing up to \q\ symbols, the total search time 
is 0(\q\ log n). Once the interval A[sp,ep] is determined, all the occurrences of q start at A[i\ for 
sp < i < ep. Compressed full-text self-indexes permit representing both C and A within the space 
required to represent C in compressed form, and for example determine the range [sp, ep] within 
time 0(\q\ logo") and list each A[i] in time Cm log 1+e n) for any constant e > [28, 51]. 



For listing the distinct documents where q appears, one option is to find out the document to 
which each A[i\ belongs and remove duplicates. This, however, requires f2(ep — sp + 1) time; that 
is, it is proportional to the total number of occurrences of q, occ = ep — sp + 1. This may be much 
larger than the number of distinct documents where q appears, docc. 

Muthukrishnan [49] solved this problem optimally by defining a so-called document array D[l, n], 
so that D[i] is the document suffix A[i\ belongs to. Other required data structures in his solution are 
an array C[l, n], so that C[i] = maxj^ D[j] = D[i], and a data structure to compute range minimum 
queries on C, RMQc(i, j) = argminj <fe<J C[A;]. Muthukrishnan was able to list all the distinct 
documents where q appears in time O(docc) once the interval A[sp, ep] was found. However, the data 
structures occupied 0(n log n) bits of space, which is too much if we consider the compressed self- 
indexes that solve the basic string search problem. Another problem is that the resulting documents 
are not retrieved in ascending order, which is inconvenient for several purposes. 

Valimaki and Makinen [66] were the first to illustrate the power of wavelet trees for this problem. 
By representing D with a wavelet tree, they simulated C[i] = select D [i](D,rank D ^(D,i — 1)) 
without storing it. By using a 2n-bit data structure for RMQ [29], the total space was reduced to 
nlogm(l + o(l)) + 0(n) bits, and still Muthukrishnan's algorithm was simulated within reasonable 
time, Oidocc log m). 

Ranked document retrieval is usually built around two measures: term frequency, tf ' dq = 
doc-frequency(q,d) is the number of times the query q appears in document d, and the docu- 
ment frequency df q , the number of different documents where q appears. For example a typi- 
cal weighting formula is w^q = tf dq x idf q , where idf q = log ^p- is called the inverse docu- 
ment frequency. Term frequencies are easily computed with wavelet trees as doc -frequency (q,d) = 
rank d {D, ep) — rank d {D, sp — 1). Document frequencies can be computed with just 2n + o{n) more 
bits for the case of the D array [61], and on top of a wavelet tree for the C array for more general 
scenarios [31]. 

In Section 5 we show how our new algorithms solve the document listing problem within the 
same time complexity 0(docc log m), without using any RMQ data structure, while reporting the 
documents in increasing order. This is the basis for a novel algorithm to list the documents where 
two (or more) queries appear simultaneously. We extend these solutions to temporal and hierarchical 
document collections. 

3.2 Inverted Indexes 

The inverted index is a classical IR structure [5, 67], lying at the heart of most modern Web search 
engines and applications handling natural-language text collections. By "natural language" texts 
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one refers to those that can be easily split into a sequence of words, and where queries are also 
limited to words or sequences thereof (phrases). An inverted index is an array of lists. Each array 
entry corresponds to a different word of the collection, and its list points to the documents where 
that word appears. The set of different words is called the vocabulary. Compared to the document 
retrieval problem for general strings described above, the restriction of word queries allows inverted 
indexes to precompute the answer to each possible word query. 

Two main variants of inverted indexes exist [4,69]. Ranked retrieval is aimed at retrieving 
documents that are most "relevant" to a query, under some criterion. As explained, a popular 
relevant formula is w^ q = tj ' d q x idf q , but others built on tf and df , as well as even more complex 
ones, have been used. In inverted indexes for ranked retrieval, the lists point to the documents where 
each word appears, storing also the weight of the word in that document (in the case of tf x idf, 
only tf values are stored, since idf depends only on the word and is stored with the vocabulary). 
IR queries are usually formed by various words, so the relevance of the documents is obtained by 
some form of combination of the various individual weights. Algorithms for this type of query have 
been intensively studied, as well as different data organizations for this particular task [57, 67, 69, 
1, 65]. List entries are usually sorted by descending weights of the term in the documents. 

Ranked retrieval algorithms try to avoid scanning all the involved inverted lists. A typical 
scheme is Persin's [57]. It first retrieves the shortest list (i.e., with highest idf), which becomes 
the candidate set, and then considers progressively longer lists. Only a prefix of the subsequent 
lists is considered, where the weights are above a threshold. Those documents are merged with 
the candidate set, accumulating relevance values for the documents that contain both terms. The 
longer the list, the least relevant is the term (as the tfs are multiplied by a lower idf), and thus 
the shorter the considered prefix of its list. The threshold provides a time/quality tradeoff. 

The second variant is the inverted indexes for so-called full-text retrieval (also known as boolean 
retrieval). These simply find all the documents where the query appears. In this case the lists 
point to the documents where each term appears, usually in increasing document order. Queries 
can be single words, in which case the retrieval consists simply of fetching the list of the word; or 
disjunctive queries, where one has to fetch the sorted lists of all the query words and merge them; 
or conjunctive queries, where one has to intersect the lists. Intersection queries are nowadays more 
popular, as this is Google's default policy to treat queries of several words. Another important 
query where intersection is essential is the phrase query, where intersecting the documents where 
the words appear is the first step. 

While intersection can be achieved by scanning all the lists in synchronization, faster approaches 
aim to exploit the the phenomenon that some lists are much shorter than others [68]. This general 
idea is particularly important when the lists for many terms need to be intersected. The amount 
of recent research on intersection of inverted lists witnesses the importance of the problem [26, 
8,3,6,10,63,24,9] (see Barbay et al. [11] for a comprehensive survey). In particular, in-memory 
algorithms have received much attention lately, as large main memories and distributed systems 
make it feasible to hold the inverted index entirely in RAM. 

Needless to say, space is an issue in inverted indexes, especially when combined with the goal of 
operating in main memory. Much research has been carried out on compressing inverted lists [67, 
52, 69, 24], and on the interaction of compression with query algorithms, including list intersections. 
Most of the list compression algorithms for full-text indexes rely on the fact that the document 
identifiers are increasing, and that the differences between consecutive entries are smaller on the 
longer lists. The differences are thus represented with encodings that favor small numbers [67]. 



8 



Random access is supported by storing sampled absolute values. For lists sorted by decreasing 
weights, these techniques can still be adapted: most documents in a list have small weight values, 
and within the same weight one can still sort the documents by increasing identifier. 

A serious problem of the current state of the art is that an IR system usually must support both 
types of retrieval: ranked and full-text. For example, this is necessary in order to provide ranked 
retrieval on phrases. Yet, to maintain reasonable space efficiency, the list must be ordered either 
by decreasing weights or by increasing document number, but not both. Hence one type of search 
will be significantly slower than the other, if affordable at all. 

In Section 6 we show that wavelet trees allow one to build a data structure that permits, within 
the same space required for a single compressed inverted index, retrieving the list of documents 
of any term in either decreasing-weight or increasing-identifier order, thus supporting both types 
of retrieval. Moreover, we can efficiently support the operations needed to implement any of the 
intersection algorithms, namely: retrieve the ith element of a list, retrieve the first element larger 
than x, retrieve the next element, and several more complex ones. In addition, our structure offers 
novel ways of carrying out several operations of interest. These include, among others, the support 
for stemming and for structured document retrieval without any extra space cost. 

4 New Algorithms 
4.1 Range Quantile 

Two naive ways of solving query range-quantile(i,j,k) are by sequentially scanning the range in 
time 0(j — i + 1) [13], and storing the answers to the C(n 3 ) possible queries in a table and returning 
answers in 0(() 1) time. Neither of these solutions is really satisfactory. 

Until recently there was no work on range quantile queries, but several authors wrote about 
range median queries, the special case in which k is half the length of the interval between i and j. 
Krizanc et al. [41] introduced the problem of preprocessing for range median queries and gave four 
solutions, three of which require time superlogarithmic in n. Their fourth solution requires almost 
quadratic space, storing (D(n 2 log log nj log n) words to answer queries in constant time (a word holds 
logo" bits). Bose et al. [15] considered approximate queries, and Har-Peled and Muthukrishnan [35] 
and Gfeller and Sanders [33] considered batched queries. Recently, Krizanc et al.'s fourth solution 
was superseded by one due to Petersen and Grabowski [58,59], who slightly reduced the space 
bound to 0( n 2 (log logn) 2 / log 2 n) words. 



At about the same time we presented the early version of our work [32], Gfeller and Sanders [33] 
gave a similar C(n)-word data structure that supports range median queries in 0(log n) time and 
observed in a footnote that "a generalization to arbitrary ranks will be straightforward". A few 
months later, Brodal and J0rgensen [18] gave a more involved data structure that still takes 0{n) 
words but only C(log nj log log n) time for queries. These two papers have now been merged [17]. 
Very recently, J0rgensen and Larsen [39] proved a matching lower bound for any data structure 
that takes ralog^ 1 * 1 n space. 

In the sequel we show that, if S is represented using a wavelet tree, we can answer general range 
quantile queries in 0(log u) time, where u < min(cj, n) is the number of distinct symbols in S. As 
explained in Section 2, within these nloga + 0(n) bits of space we can also retrieve any element 
S[i] in time O(logn), so our data structure actually replaces S (requiring only 0{n) extra bits). The 
latest alternative structure [39] may achieve slightly better time but it requires O(nlogn) extra 
bits of space, apart from being significantly more involved. 
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Algorithm 3 New wavelet tree algorithms: rqq(v root , i, j, k) returns (range-quantile(S,i,j,k),f) 
on the wavelet tree of sequence S, assuming k < j — i + 1, and where / is the frequency of the 
returned element in S[i,j]; rnv(v roo t,i,j,0,x) returns (range_next_value(S,i,j,x),f,p), where / 
is the frequency and p is the smallest rank of the returned element in the multiset S[i,j] (the 
element is ± if no answer exists); and rint(v roo t, i\, ji, 12, 32, [v s ,U e \) solves an extension of query 
range-intersect (S,i±, 3\,%2, 32) outputting triples (y, /i,/2), where y are the common elements, f\ 
is their frequency in S[ii,ji], and /2 is their frequency, in 5[«2,j2], and moreover y s < y < y e . 



rqq(f, i,j,k) 

if v is a leaf then 

return (label(v),j — i + 1) 
else 

ii 4- rank (B v , i — 1) + 1 
ji «- rank (B v ,j) 

ir <— i - U, jr <— j - jr 

ni <r- ji - ii + 1 
if k < ni then 

return rqq(vi, it, ji, k) 
else 

return rqq(u r , i r ,j r ,k — m) 
end if 
end if 



rnv(«, i,j,p,x) 

if i > j then 

return (±,0,0) 
else if v is a leaf then 

return (x, j — i + l,p) 
else 

ii ranko{B v ,i — 1) + 1 
ji <- rank (B v ,j) 

ir <— i — il, jr <~ j — jr 

ni <— ji — k + 1 

if x € labels(v r ) then 

return rnv(w r , i r , j r , p + n;, x) 
else 

(2/,/) <~ rnv(« ; ,i ; , j h p,x) 
if j/ then 

return (y,f) 
else 

return rnv(i> r , i r , j r ,p + m, 
minlabels(vr)) 

end if 
end if 
end if 



rint(v, i 1 ,j 1 ,i 2 ,j 2 ,rng) 
if ii > ji V !2 > ji then 
return 

else if labels(v) n rnp = then 

return 
else if v is a leaf then 

output (label(v), 

ji + l,j 2 - ii + 1) 

else 

ii; ranko(B v ,ii - 1) + 1 
j'k ^- rank (B v ,jx) 

ilr ll — ili, jlr <— Jl — Jlr 

i 2! <- rank (B v ,i 2 - 1) + 1 
jii 4- rank (B 

Vl Ji) 

ilr i'2 — 111 , jlr ji — jlr 

r'mt(vi,iu,ju,iil,jil, rng) 
rint(u r , hr, jlr, iir, jir, rng) 
end if 



Theorem 1 Given a sequence S[l, n] storing u distinct values over alphabet [1, a], we can represent 
S within n log a + 0(n) bits, so that range quantile queries are solved in time C(log u) . Within that 
time we can also know the number of times the returned value appears in the range. 

Proof. We represent S using a wavelet tree T, as in Lemma 1. Query range _quantile(i, j, k) is 
then solved as follows. We start at the root of T and consider its bitmap B root . We compute 
n\ = ranko(B root ,j) — ranko(B root ,i — 1), the number of 0s in B root [i,j]. If n\ > k, then there 
are at least k symbols in S[i,j] that label leaves descending from the left child 7} of T, and thus 
we must find the kth symbol on T\. Therefore we continue recursively on T\ with the new values 
i i— ranko(B root ,i — 1) + 1, j 4— ranko(B root , j), and k unchanged. Otherwise, we must descend 
to the right child, mapping the range to i 4— ranki(B root ,i — 1) + 1 and j 4— rank\{B root , j). In 
this case, since we have discarded n\ numbers that are already to the left of the A;th value, we set 
k 4— k — n\. When we reach a leaf, we just return its label. Furthermore, we have that the value 
occurs 3 + 1 times in the original range. Since T is balanced and we spend constant time at each 
node as we descend, our search takes O(logtt) time. □ 

Algorithm 3 (left) gives pseudocode. Note that, if u is constant, then so is our query time. 
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4.2 Range Next Value 

Again, two naive ways of solving query range-next-value(i, j , x) on sequence S*[l,n] are scanning in 
C(i — * + 1) worst-case time, and precomputing all the possible answers in C(n 3 ) space to achieve 
constant time queries. Crochemore et al. [23] reduced the space to 0(n 2 ) words while preserving the 
constant query time. Later, Crochemore et al. [22] further improved the space to 0(n 1+e ) words. 
Alternatively, Makinen et al. [46, Lemma 4] give a simple 0(n)-words space solution based on an 
augmented binary search tree. This yields time 0(log«), where once again u < min(n, a) is the 
number of distinct symbols in S and [l,cr] the domain of values. For the particular case of semi- 
infinite queries (i.e., i = 1 or j = n) one can use an 0(n)-words and O(loglogn) time solution by 
Gabow et al. [30]. 

By using wavelet trees, we also solve the general problem in time O(logu). Our space is bet- 
ter than the simple linear-space solution, n + 0(n/ log a) words (n of which actually replace the 
sequence). 

Theorem 2 Given a sequence S[l, n] storing u distinct values over alphabet [1, a], we can represent 
S within nloga + 0{n) bits, so that range next value queries are solved in time O(logu). Within 
the same time we can return the position of the first occurrence of the value in the range. 

Proof. We represent S using a wavelet tree T, as in Lemma 1. Query rang e _next _v alue (i, j, x) is 
then solved as follows. We start at the root of T and consider its bitmap B root . If x labels a leaf 
descending by the right child T r , then the left subtree is irrelevant and we continue recursively on 
T r , with the new values i <— ranki(B root ,i — 1) + 1 and j <— ranki(B rootl j). Otherwise, we must 
descend to the left child 7), mapping the range to i <— ranko(B roo t, i— 1) + 1 and j <— ranko(B roo t, j). 
If our interval [i,j] becomes empty at any point, we return with no value. 

When the recursion returns from T r with no value, we return no value as well. When it returns 
from T\ with no value, however, there is still a chance that a number > x appears on the right in 
the interval Indeed, if we descend to T r and map i and j accordingly, and the interval is not 
empty, then we want the minimum value of that interval, that is, the minimum value in S[[i,j]. 
This is a particular case of a range -quantile query carried out on a wavelet (sub)tree T r . The overall 
time is O(logw). □ 

Algorithm 3 (middle) gives pseudocode. While our space gain may not appear very impressive, 
we point out that our solution requires only 0(n) extra bits on top of the sequence (if we accept 
the logarithmic slowdown in accessing S via the wavelet tree). Moreover, we can use the same 
wavelet tree to carry out the other algorithms, instead of requiring a different data structure for 
each. This will be relevant for the applications, which need support for several of the operations 
simultaneously. 

4.3 Range Intersection 

The query range -inters ect{%\ , j± ,i2,j2), which finds the common symbols in two ranges of a sequence 
S[l, n] over alphabet [1, a], appears naturally in many cases. In particular, a simplified variant where 
the two ranges to intersect are sorted in increasing order arises when intersecting full-text inverted 
lists, when solving intersection, phrase, or proximity queries. 
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Worst-case complexity measures depending only on the range sizes are of little interest for this 
problem, as an adversary can always force us to completely traverse both ranges, and time com- 
plexity 0(ji — i\ + j2 — 12 + 1) is easily achieved through merging 9 . More interesting are adaptive 
complexity measures, which define a finer difficulty measure for problem instances. For example, in 
the case of sorted ranges, an instance where the first element of the second range is larger than the 
last element of the first range is easier (one can establish the emptiness of the result with just one 
well-chosen comparison) than another where elements are mixed. 

A popular measure for this case is called alternation and noted a [8]. For two sorted sequences 
without repetitions, a can be defined as the number of switches from one sequence to the other in 
the sorted union of the two ranges, or equivalently, as the time complexity of a nondeterministic 
program that guesses which comparisons to carry out, or equivalently as the length of a certificate 
that, through the results of comparing elements of both sequences, is sufficient to prove what the 
result is. This definition can be extended to intersecting k ranges. Formally, the measure a is defined 
through a function C : [l,c] —¥ [0,k], where C[c] gives the number of any range where symbol c 
does not appear, and C[c] = if c appears in all ranges. Then a is the number of zeros in C plus 
the minimum possible number of switches (i.e., C[c] ^ C[c + 1]) in such a function. A lower bound 
in terms of alternation (still holding for randomized algorithms) [8] is I? (a ■ J2i< r <k l°g 7?) > where 
n r is the length of the rth range. There exist adaptive algorithms matching this lower bound [26, 
8,9]. 

We show now that the wavelet tree representation of 5[l,n] allows a rather simple intersection 
algorithm that approaches the lower bound, even if one starts from ranges of disordered values, 
possibly with repetitions. For k = 2, we start from both ranges [ii, ji] and [«2, J2] at the root of the 
wavelet tree. If either range is empty, we stop. Otherwise we map both ranges to the left child of the 
root using ranko, and to the right child using rank\. We continue recursively on the branches where 
both intervals are nonempty. If we reach a leaf, then its corresponding symbol is in the intersection, 
and we know that there are j\ — i± + 1 copies of the symbol in the first range, and j'2 — 12 + 1 in 
the second. For k ranges j r ] 5 we maintain them all at each step, and abandon a path as soon as 
any of the k ranges becomes empty. Algorithm 3 (right) gives pseudocode for the case k = 2. 

Lemma 6. The algorithm just described requires time O(afclog^), where u is the number of dis- 
tinct values in the sequence and a is the alternation complexity of the problem. 

Proof. Consider the function p : £ — >■ {0, 1}*, so that p(c) is a bit stream of length equal to the 
depth of the leaf representing symbol c in the wavelet tree. More precisely, p[i] is if the leaf 
descends from the left child of its ancestor at depth i, and 1 otherwise. That is, p(c) describes the 
path from the root to the wavelet tree leaf labeled c. 

Now let T r be the trie (or digital tree) formed by the strings p(c) for all those c appearing in 
S[i r ,jr]i an d let T n be the trie formed by the branches present in all T r , 1 < r < k. It is easy to see 
that T n contains precisely the wavelet tree nodes traversed by our intersection algorithm, so the 
complexity of our algorithm is C(|T n |). 

We show now that |T n | has at most a leaves. The leaves of T n that are wavelet tree leaves 
correspond to the symbols that belong to the intersection, and thus to the number of 0s in any 
function C. This is accounted for in measure a. So let us focus on the other leaves of T n . Consider 
two consecutive leaves of T n that are not wavelet tree leaves u\ and 112, and any symbols c\ < C2 

9 If the ranges are already ordered; otherwise a previous sorting is necessary. 
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whose wavelet tree leaves v\ and V2 descend from u\ and U2, respectively. If there were a single 
range S[i r ,j r ] where c\ and C2 would not belong, then the lowest common ancestor of v\ and V2 
would not belong to T n , and thus there could not be two leaves u\ and U2 in T n . Therefore, for 
each pair of consecutive leaves in T n there is at least one switch in C, and thus there are at most 
a leaves in T n . Thus, by Lemma 3, T n has O(alog^) nodes. To obtain the final cost we multiply 
by k, which is the cost of maintaining the k ranges throughout the traversal. □ 

In the case where all the lists are sorted and without repetitions (so n r < u), our algorithm 
complexity is pretty close to the lower bound, matched when all n r = u. Note also that our algorithm 
is easily extended to handle the so-called (t, k) -thresholded problem [8], where we return any symbol 
appearing in at least t of the k ranges. It is simply a matter of abandoning a range only when more 
than k — t ranges have become empty. 

A different form of carrying out the intersection is via the query range_next_value(S,i, j, x): 
Start with xi <— range-next-value(S,ii, ji,l) and X2 range_next-value(S,i2,j2,xi)- If X2 > x\ 
then continue with x\ <— range jnext-value{S,i\, ji, X2); if now x\ > X2 then continue with X2 
range_next_value(S,i2, j2, xi); and so on. If at any moment x\ = X2 then output it as part of the 
intersection and continue with x\ range_next_value(S,ii,j\,X2 + 1). It is not hard to see that 
there must be a switch in C for each step we carry out, and therefore the cost is O(alogn). 

To reduce the cost to O(alog^), we carry out a fingered search in range -next .value queries, 
that is, we remember the path traversed from the last time we called range.next_value(S,i, j, x) 
and only retraverse the necessary part upon calling range-next_value(S,i,j,x') for x' > x. For this 
reason we move upwards from the leaf where the query for x was solved until reaching the first 
node v such that x' € labels(v), and complete the rnv procedure from that node. Since the total 
work done by this point is proportional to the number of distinct ancestors of the a leaves arrived 
at, the complexity is O(alog^) by Lemma 3. 

This second procedure is the basis of most algorithms for intersecting two or more lists [11]. 
The rint method we have presented is simpler, potentially faster, and more flexible (e.g., it is easily 
adapted to t-thresholded queries). Moreover, it is specific to the wavelet tree. 

5 Document Listing and Intersections 

The algorithm for range_report(P,x s ,x e ,y s ,y e ) queries described in Section 2 can be used to solve 
problem doc -listing (q), as follows. As explained in Section 3.1, use a (compressed) suffix array A 
to find the range A[sp, ep] corresponding to query q, and use a wavelet tree on the document array 
D[l,n] on alphabet [l,m], so that the answer is the set of distinct document numbers d\ < c?2 < 
... < ddocc in D[sp,ep\. Then rang e -report (D, sp,ep, 1, m) returns the docc document numbers, 
in order, in total time 0(docc log 3333). Moreover, procedure report in Algorithm 2 also retrieves 
the frequencies of each di in D[sp,ep], outputting the pairs (di,tf q d .) within the same cost. (As 
explained, arbitrary frequencies tf d q = doc -frequency (q,d) can also be obtained in time O(logm) 
by two rankd queries on D.) Alternative solutions using range -quantile or range jnextjvalue queries 
are possible, and will be explored later for other applications. 

As explained in Section 3.1, this is simpler and requires less space than various previous solu- 
tions 10 , and has the additional benefit of delivering the documents in increasing document identifier 
order. This enables us to extend the algorithm to more complex scenarios, as shown in Section 6. 

10 It is even better than our previous solution based on range _quantile queries [32], which takes time 0(docc log tn). 
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Now consider k queries qi, q2, ■ ■ ■ , qk, and the problem of listing the documents where all those 
queries appear (i.e., problem doc -intersect (qi, . . . , (/&)). With the suffix array we can map the queries 
to ranges [sp r , ep r ], and then the problem is that of finding the distinct document numbers that ap- 
pear in all those ranges. This corresponds exactly to query range _intersect(D , spi, epi, . . . , spk, epk), 
which we have solved in Section 4.3. We have indeed solved a more general variant where we list the 
documents (and their tf d qr values) where at least t of the k terms appear. Note this corresponds 
to the disjunctive query for the case t = 1. 

5.1 Temporal and Hierarchical Documents 

The simplest extension when we have versioned or hierarchical documents is to restrict queries 
doc -listing (q) and doc-intersect(qi, . . . ,qk) to a range of documents [d m i n ,d max ], which represents 
a temporal interval or a subtree of the hierarchy in which we are interested. Such a restricted docu- 
ment listing and intersection is easily supported by setting rng = [d m i n , d max ] in procedures report 
(Algorithm 2) and rint (Algorithm 3), respectively. The complexities are O^docc log rfma ^^' n+1 ) 

for listing and O^a log rfma3: ~^""" +1 ^ for intersections, due to Lemma 5. 

When the hierarchical documents represent nodes in an XML collection, other queries of interest 
become obvious. Indeed, how to carry out ranking on XML collections is an unresolved issue, with 
very complex ranking proposals counterweighted by others advocating simple measures. Rather 
than trying to cover such a broad topic, we refer the reader to comprehensive surveys and discussions 
in the article by Hiemstra and Mihajlovic [36], the PhD thesis of Pehcevski [56, Ch. 2], and the 
recent book by Lalmas [42, Ch. 6]. 

In most models, the frequency of a term within a subtree, and the size of such subtree, are central 
to the definition of ranking strategies. The latter is usually easy to compute from the sequence 
representation. The former, a generalization of doc -frequency to ranges, can actually be computed 
with query range_count(D, sp,ep,dl,dr), defined in the Introduction (see also Algorithm 2), where 
[sp, ep] is the suffix array range corresponding to query q, and [dl,dr] is the range of documents 
corresponding to our structural element. This query also takes time C(logm). 

5.2 Restricting to Retrievable Units 

We focus now on a more complex issue that is also essential for XML ranked retrieval. Query 
languages such as XPath and XQuery define structural constraints together with terms of interest. 
For example, one might wish to retrieve books about the term "cryptography", or rather book 
sections about that term, in each case ranked by the relevance of the term. Thus the definition of 
the retrievable unit (books, sections) comes in the query together with the terms (cryptography) 
whose relevance is to be computed with respect to the retrievable units that contain it. We show 
now how to support a simple model where the retrievable units are defined by an XML tag name, 
and consider other models at the end. We report the smallest retrievable unit containing the query 
occurrences. 

Following common models of XML data (e.g., [2]), we consider that text data can appear only 
at the leaves of the XML structure, so that we create extra leaves if text data appears between 
consecutive structural elements (a bitmap may be used to mark leaves that do not contain any text 
data, but we omit this detail here for clarity). Thus, each leaf of the XML tree will be associated 
with a document number, 1 to m, so that di will be the document associated to the ith leaf. The 



14 



XML tree, containing n nodes, will be represented using a sequence P[l,2n] of parentheses [38]. 
These are obtained through a preorder traversal, by appending an opening parenthesis when we 
reach a node and a closing one when we leave it. A tree node will be identified with the position of 
its opening parenthesis in P. Several succinct data structures can represent the parentheses within 
2n + o(n) bits and simulate a wealth of tree operations in constant time (e.g., [62]). 

In addition we represent a sequence Tag[l, In] giving the tag name associated to each parenthesis 
in P. Sequence Tag is represented using a wavelet tree in 2n log r + 0(n) bits of space, where r 
is the number of distinct tags in the collection. Finally, for each distinct tag name t we store a 
parenthesis representation Pt of the nodes of the XML tree that are tagged t. The total space for 
P, Tag, and the Pt trees is 2nlogr + 0(n) bits. 

A first task we can carry out is, given an occurrence in document number (i.e., leaf) i, find 
expand(t, i), the range of documents (i.e., leaves) corresponding to its lowest ancestor tagged t. This 
allows us to find the closest retrievable unit to which the occurrence at leaf i must be assigned. 
We use operation j = selectLeaf (P,i) to find the ith leaf of P. Then r = rankt(Tag, j) finds the 
rank of the last occurrence of t in Tag preceding j. If Pt[r] = '(', then r is the lowest ancestor of 
i tagged t, otherwise it is r <— parent(P t ,r), the node tagged t that encloses position r. Finally, 
position r is mapped back to the global tree P with p = select t (Tag,r), and we return the range 
of leaves corresponding to p, expand{t, i) = leaf -range(p) = [rankLeaf (P, p) + 1, rankLeaf (P, p + 2 • 
subtreeSize(P, p))], where rankLeaf and subtreeSize are self-explanatory tree operations. The process 
takes O(logr) time, dominated by the costs to operate on Tag. Algorithm 4 (left) gives pseudocode. 

If we now want to count the number of occurrences of our query q in a retrievable node p, we 
need to count the number of occurrences of the range of leaves (i.e., document numbers) below p 
within the interval D[sp,ep] corresponding to query q. Such a range is easily obtained in constant 
time as [dl,dr] = leaf _range(p). Then the result is range -Count{D , sp, ep, dl, dr), as explained. 

To carry out document listing restricted to structural elements tagged t, we build on range 
next value queries. We start with d\ = range-next-value{D, sp,ep,l), which gives us the small- 
est (leaf) document number in D[sp, ep]. Now we compute [dl\,dri] = expand{t,d\), the range 
of the lowest node tagged t that contains d\. Then we find the next leaf document using cfo = 
range jnext_value{p , sp, ep, dr\ + 1), and so on. In general, di + \ = range jnext Jvalue{D , sp, ep, dri + 
1). Algorithm 4 (left) gives pseudocode. The cost per document retrieved is 0(logr + logm). 
However, using the fingered search on rnv outlined in Section 4.3, the overall cost reduces to 
0{docc (log r + log . If we wish to additionally restrict the retrieval to documents in the range 
[dmin,d ma x], we simply start with d\ = range -next -value(D , sp,ep,d m in) and stop when we re- 
trieve a document larger than d max . The cost improves to o[docc (logr + log dmax ~[o™ in+1 ^ due 
to Lemma 5. Complexity returns to 0(docc log m) if we compute also the frequency in each retriev- 
able unit using hdfreq. 

Finally, to carry out intersections restricted to retrievable units, we follow in principle the 
same algorithm outlined in Section 4.3. The difference is that we must not split retrievable ranges. 
Therefore, when we are at any wavelet tree node, before going to the left child that represents the 
range of symbols [d a ,db] and/or to the right child representing range [d\, + l,d c ], we first find out 
whether there is a retrievable unit covering [c4, <4+l]. To do this we compute expand(t, db) = [dl, dr] 
and expand(t,db + 1) = [dl' ,dr'\. Since the leaves are consecutive, there are only two possibilities: 
either [dl,dr] = [dl',dr'] or they are disjoint (and dr = db and dl' = db + 1). In the latter case 
we proceed with the recursion as in Section 4.3. In the former case, we descend to the left child 
with document range restricted to [d a ,dl — 1], then report document [dl,dr] if it belongs to the 
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Algorithm 4 Algorithms for hierarchical document listing and intersections: exp(Tag,P,Pt,t,i) 
computes the node in P for expand(t,i) and leafRange(P,p) computes leaf -range (p); 
hdfreq(P, D, sp, ep,p) computes the frequency of p in D[sp,ep\; hdlist(^4, D, Tag, P, Pt,t,q) lists 
the retrievable units where q appears; and hdint( Tag, P, P t , t, v root , i\, ji, i2,j2, rng) lists the re- 
trievable units with leaves in both D[i\,ji\ and D[i2,j2\, with their frequencies in both ranges, and 
subject to belonging to document range rng (which is assumed not to split any retrievable unit). 



exp(Tag,P,P t ,t,i) 
j <— selectLeaf (P, i) 
r rank( Tag, t, j) 
if P t [r] =')' then 

r parent(Pt, r) 
end if 

return select( Tag, t, r) 

leafRange(P, p) 

return [rankLeaf (P, p) + 1, 
rankLeaf (P, p + 2 • subtreeSize(P,p))] 

hdfreq(P, D, sp, ep,p) 

[dl,dr] <— leafRange(P, p) 
return count (D, sp, ep, [dl, dr\) 

hdlist(^, D, Tag,P,P t ,t,q, [d 

min , dma 

[sp, ep] <— patter n^search( A, q) 
v root(D) 

(d, f, r) <- rnv(i), sp, ep, 0, d min ) 
while d /_L A d < d max do 

p *r- exp(Tag,P,P t ,t,d) 

output p 

[dl,dr] <— leafRange(P, p) 
(d, f, r) <— rnv(u, sp, ep, 0, dr + 1) 
end while 



hdint( Tag, P,P t ,t,v,i!, ji , i 2 , j 2 , rng) 
if ii > ji V !2 > j2 V rng = then 

return 
else if v is a leaf then 

output (label(v),ji — ii + 1, ji — i-z + 1) 
else 

[dl,dr] <- 0, /i 0, h <- 

ii; *r- rank (B v ,ii - 1) + 1, j i; rank (B v ,j!) 

ilr <— ll — ill, jlr <— jl — jlr 

i2i <- rank (B v ,i2 — 1) + 1, J2i rank {) (B 

I2r <— 12 - 121 , 32r <~ J2 - J2r 

if ill < jl! A 12! < J2i A llr < Jlr A i 2 r < J2r A 

labels(vi) PI rng / I A labels(v r ) (~l rng 7^ then 
p; enp(Tag, P, P t ,t,majilabels(vi)) 
p r exp( Tag, P, Pt, t, min labels(v r )) 
if = p r then 

[cM, dr] <— leafRange(P, pi) 
]) /1 count(w, ii, ji, [dl, dr]) 

f2 count (v, i2,j2, [dl,dr]) 
end if 
end if 

hdint( Tag, P, P t ,t,vi,i u , jy, i 2 i, J21, (labels(v t ) n rng) - [c(Z,dr]) 
if fi > A / 2 > then 

output {pi,fi,h) 
end if 

hdint(Tag,P,Pt,t,v r ,ii r ,jir,i2r,j2r, (labels(v r ) Cirng) — [dl,dr]) 
end if 



intersection, and finally descend to the right child with document range restricted to [dr + l,d c ]. 
By "descending with document range restricted to [x,y]" we mean we abandon branches whose 
document range has no intersection with [x,y], and such restrictions are inherited as we descend. 

Algorithm 4 (right) gives pseudocode. The complexity is the same as if the retrievable units were 
materialized into consecutive document numbers, that is, C(alog ^) under this interpretation. The 
only extra cost is the computation of f\ and / 2 . Note, however, that these are computed with a 
range_count query restricted to the local subtree, and thus the cost at height h is O(h). Moreover, 
this is computed only for nodes of trie T n (recall Lemma 6) having two children, that is, at most 
a times. The most expensive case is thus when all those a nodes are as high as possible in the 
wavelet tree, in which case the 0(h) costs add up to C(alog ^) and do not affect the complexity. 
Once again, we can restrict the results to a range [d m i n ,d ma x] with the usual time improvement. 

Other possibilities for marking the retrievable documents can be supported, as long as one is 
able to find the lowest retrievable ancestor of any leaf. For example we could mark retrievable 
nodes in a bitmap B[l, 2n\ aligned with P, where we set to 1 the opening and closing parentheses 
of retrievable nodes. Then we can compute expand (B,i) via rank and select operations on B in 
constant as follows. We start with j = selectLeaf (P, i) , then p = select\{ranki{B,j)), then if 
P[p] = ')' we recompute p = parent(P,p), and finally expand(B,i) = leaf _range(p). In cases where 



16 



the retrievable units are denned dynamically, say from previous parts of the query processing, we can 
store them in a balanced tree, so that query select\(r ank\{B , j)) (which is actually a predecessor 
query) can be answered in 0(log n) time. 

6 Inverted Lists 

Recall m is the total number of documents in the collection and let v be the number of different 
terms. Let Lt[l, df t ] be the list of document identifiers where term t appears, in decreasing weight 
order (for concreteness we will assume we store tf values in the lists as weights, but any weight 
will do). Let n = J2t dft be the total number of occurrences of distinct terms in the documents, 
and iV = J2t d tftd the total length, in words, of the text collection (thus m < n < mm(mu, N)). 
Finally, let \q\ be the number of terms in query q. 

We propose to concatenate all the lists L t into a unique sequence L[l,n], and store for each 
term t the starting position s t of list L t within L. The sequence L of document identifiers is then 
represented with a wavelet tree. 

The tf values themselves are stored in differential and run-length compressed form in a separate 
sequence. More precisely, we mark the vt different tf td values of each list in a bitmap T t [l, m t ], where 
m t = maxrf tf td , and the vt points in L t [l, df t ] where value tf dt changes, in a bitmap Rt[l, df t ]. 
Thus one can obtain tf tLt uj = selecti(Tt,vt — rank±(Rt, + The st sequence is also represented 
using a bitmap £[1,71] providing rank/ select operations. Thus we can recover st = selecti(S,t), 
and also rank\(S,i) tells us which list L[i] belongs to. 

Let us analyze the space required by our representation. According to Lemma 1, the wavelet tree 
of L occupies is nlogm + 0{n) bits. The classical encoding of inverted files, when documents are 
sorted by increasing document identifier, records the consecutive differences using 5-codes [67]. This 
needs at most J2t dft l°g -fif < n log ^ bits plus lower-order terms, which is asymptotically less 
than our space. If, however, the lists are sorted by decreasing tf values, then differential encoding 
can only be used on some parts of the lists. Yet, nlogm (plus lower-order terms) is still an upper 
bound to the space required to list the documents. As can be seen, no inverted index representation 
takes more space than our wavelet tree. However, it must be remembered that our wavelet tree will 
offer the combined functionality of both inverted indexes, and more. 

We also store the tf and the st values. The former is encoded with Tt and Rt- We use Okanohara 
and Sadakane's representation [54] for Tt and Patra§cu's [55] for Rt (see Section 2), to achieve total 
space vt log ^ + 0{vt) + vt log ^ + o(df t ) bits and retain constant time access to tf values. This 
space is similar to that needed to represent, in a traditional i/-sorted index, each new tf td value and 
the number of entries that share it. The st values require z^log ^ + o(n) bits using again Patra§cu 
[55], which gives constant-time access to st and requires less space than the usual pointers from 
the vocabulary to the list of each term. Overall our data structure takes at most nlog(mi^) + 0(n) 
bits. 

We will now consider the classical and extended operations that can be carried out with our 
data structure. In particular we will show how to give some support for hierarchical document 
retrieval (as already seen for general documents) and for stemmed searches without using any extra 
space. One common way to support stemming is by coalescing terms having the same root at index 
construction time. However, the index is then unable to provide non-stemmed searching. One can 
of course index the stemmed and non-stemmed occurrence of each term, but this costs space. Our 
method can provide both types of search without using any extra space provided all the variants 
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of the same stemmed word be contiguous in the vocabulary (this is in many cases automatic as 
stemmed terms share the same root, or prefix). 

6.1 Full- Text Retrieval 

The full-text index, rather than L t , requires a list F t , where the same documents are sorted by 
increasing document identifier. Different kinds of access operations need to be carried out on F t . 
We now show how all these can be supported in C(log m) time or less. 

Direct retrieval First, with our wavelet tree representation of L we can compute any specific 
value F t [k] in time O(logm). This is equivalent to finding the kth smallest value in L[s t , s t +i — 1], 
that is, query range-quantile(L, st, st+i — l,k) described in Section 4.1. 

We can also extract any segment F t [k,k'], in order, in time o({k' — k + 1) log k il1 + -^ j , that 
is, faster per document as we extract more documents. The algorithm is as for range-quantile on 
quantiles k to k' simultaneously, going just by one branch when both k and k' choose the same 
branch, and splitting the interval into two separate searches when they do not. We arrive at k' — k+l 
leaves of the wavelet tree, thus the cost follows from Lemma 3. 

Another useful operation is fingered search, that is, to find Ft[k'] after having visited Ft [A;], for 
some k' > k. This is slightly more complex than for consecutive range next value queries. We need 
to store logm values rug, e$ and v s , where mo = oo and e± = 0, and the others are computed as 
follows when we obtain F t [k]: at wavelet tree node v of depth 8 (the root being depth 1) we set 
v 5 <— v and, if we must go to the left child, then we set m$ ^— e$ + n\ and es+\ <— es; else we set 
m$ <— mg-i and eg+i <— eg + n\. Here n\ is the value local to the node (recall rqq in Algorithm 3). 
Therefore es counts the values skipped to the left, and mj is the maximum k' value such that the 
downward paths to compute F t [k] and F t [k'\ coincide up to depth 5. Now, to compute F t [k'], we 
consider all the 5 values, from largest to smallest, until finding the first one such that k' < ms- 
From there on we recompute the downward path, resetting m$, es, and v s accordingly. 

If we carry out this operation r times, across a range [k, k'], the cost is O^logm + r log k ~, fc+1 ) 
by Lemma 5. Algorithm 5 depicts the new extended variants of rqq. 

Intersection algorithms The most important operation in the various list intersection algorithms 
described in the literature is to find the first k such that Ft[k] > d, given d. This is usually solved 
with a combination of sampling and linear, exponential, or binary search. In our case, this operation 
takes time O(logm) with query rang e -next -value(L, s t , s t +i — l, d) described in Section 4.2. Our time 
complexity is not far from the 0(log(st+i — St)) of traditional approaches. Moreover, as explained in 
Section 4.3, we can use fingered searches on rnv to achieve time C(logm + r log — ) for r accesses. 
Furthermore, if all the accesses are for documents in a range [d, d'] then, by Lemma 5, the cost will 
be O^log m + r log d '~^ +1 ) time. This is indeed the time required by r successive searches using 
exponential search. 

Finally, we can intersect the lists Ft and Ff using range -inter sect(L, st, st+i — 1, s^, st'+i — 1), in 
adaptive time C(alog ^) — recall Section 4.3. As explained, this can be extended to intersecting 
k terms simultaneously, and to report documents where a minimum number of the terms appear. 

Other operations of interest If the range of terms [t, t'] represent the derivatives of a single 
stemmed root, we might wish to act as if we had a single list F t) t> containing all the documents 
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Algorithm 5 Extended variants of range quantile algorithms: mrqq(i; root , i, j, k, k') outputs all 
the (distinct) values range_quantile(S,i,j,k) to range_quantile(S,i, j, k'), with their frequencies, on 
the wavelet tree of sequence S, assuming k' < j — i + 1; frqql (v roo t, i,j, k) returns the same as 
rq<q(v root , i,j, k) but prepares the iterator for subsequent fingered searches; those are carried out 
by calling frqql (v roo t, k), where it is assumed that the k values increase at each call; frqq' is the 
recursive procedure that reprocesses the needed part of the path. 



mrqq(n, t, j, k,k') 

if v is a leaf then 

output (label(v),j — i + 1) 
else 

ii ranko(B v , i — 1) + 1 
ji rank (B v ,j) 

%r «~ i- k, jr j — jr 

m <— ji — k + l 

if k < ni then 

mrqq(«; ,k,ji,k, min(n ! ,k')) 
end if 

if k' > rii then 

mrqq(« r , i r , j r , max(fc — nj, 1), k') 
end if 
end if 



frqql (v,i,j, k) 
mo oo 
ei v 
i* <- i 
f J 

return frrq' (w, i, j, fe, 1) 

frqq(«, k) 

S ^— height of v 
while k > mj-i do 

6<-6-l 
end while 

return frqq' (v s ,i* , j* , k,5) 



irqq'(v,i,j,k,5) 

if v is a leaf then 

output (label(v),j — i + 
else 

« a <- v 

ii 4— ranko(B v ,i — 1) + 1 
j ; <- rank (B v ,j) 

i r i-ii, jr >, ^— j; - 't; + 1 

if fc < n; then 

m 4 <- e a + 7i( 
e«+i ■«— 

return frqq' (wj, ii, ji, k, 5 + 1) 
else 

m«s TO«_l 
es+i + 

return frrq'(iv, v, j r , fc, 8 + 1) 
end if 
end if 



where they occur. Indeed, if we apply our previous algorithm to obtain F t [k] from L[s t , st+i — 1], 
on the range L[st,s t / + i — 1], we obtain precisely F t;t /[fc], if we understand that a document d may 
repeat several times in the list if different terms in [t, t'\ appear in d. Still we can obtain the list 
of docc distinct documents for a range of terms [t, t'] with exactly the same method as for the D 
array, described at the beginning of Section 5, in time 0(docclog -£^z) ■ 

Furthermore, the algorithms to find the first k such that F t [k] > d, can be applied verbatim to 
obtain the same result for F t t i [k] > d. All the variants of these queries are directly supported as 
well. Our intersection algorithm can also be applied verbatim in order to intersect stemmed terms. 

Additionally, note that we can compute some summarization information. More precisely, we 
can obtain the local vocabulary of a document d, that is, the set of different terms that appear in 
d. By executing rank\(S, selectd(L, i)) for successive i values, we obtain all the local vocabulary, in 
order, and in time C(log m) per term. This allows, for example, merging the vocabularies of different 
documents, or binary searching for a particular term in a particular document (yet, the latter is 
easier via two rank operations on L: rankd(L, st+i — 1) — rankd(L, s t — 1); then the corresponding 
position can be obtained by selectd(L, 1 + rankd(L, st — 1))). 

Finally, the data structure provides some basic support for temporal and hierarchical documents, 
by restricting the inverted lists F t to a range of document values [d m i n , d max ] (recall Section 5.1). A 
simple way to proceed is to first carry out a query range-next-value(L, st, — 1, d m in) with rnv 
(Algorithm 3) , which will also give us the rank p of the first document > d. Then any subsequent 
range quantile query on Ft must increase its argument by p — 1, and discard answers larger than 
dmax- On the other hand, functions hdlist and hdint (Algorithm 4) will work without changes, 
and support inverted list algorithms on XML retrievable units, just as in Section 5.2. 
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6.2 Ranked Retrieval 



We focus now on the operations of interest for ranked retrieval, which are also simulated in 0(log m) 
time or less. 

Direct access and Persin's algorithm The L t lists used for ranked retrieval are directly con- 
catenated in L, so L t [i] is obtained by accessing symbol L[s t + i — 1] using the wavelet tree. Recall 
that the term frequencies tf are available in constant time. A range L t [i,i'] is obtained in time 
o(^(i' — i + 1) log ky us i n g query range -report (L, s t +i,s t + i' , [1, m]) (Algorithm 2). 

This algorithm has the problem of retrieving the documents in document order, not in tf order as 
they are in Lt . Note, however, that retrieving the highest- tf documents in document order is indeed 
beneficial for Persin's algorithm [57] (recall Section 3.2), where a problem is how to accumulate 
results across unordered document sets. More precisely, assume we have the current candidate set 
as an array ordered by increasing document identifier. Persin's algorithm computes a threshold 
term frequency /, so that the next list to consider, Lt, should be processed only for tf values that 
are at least p. Instead of traversing L t by decreasing tf values and stopping when these fall below 
/, we can compute p = selecti(Rt, vt — rank\(Tt, f) + 1) — 1, so that Lj[l,p] is precisely the prefix 
where the term frequencies are at least /. Now we extract all the values as explained. As they are 
obtained in increasing document identifier order, they are easily merged with the current candidate 
set, in order to accumulate frequencies in common documents. 

Other operations of interest Any candidate document d in Persin's algorithm can be directly 
evaluated, obtaining its tf d t values, by finding d within L t for each t € q (with rankd and selectd 
on L, as explained), and its tf obtained from Rt and Tt, all in 0(\q\ logm) time. 

If we use stemming, we might want to retrieve prefixes of several lists Lt to Lf . We may carry 
out the previous algorithm to deliver all the distinct documents in these prefixes, now carrying on 
the t' — t + 1 intervals as we descend in the wavelet tree. When we arrive at the relevant leaves 
labeled d, the corresponding positions will be contiguous, thus we can naturally return just one 
occurrence of each d in the union. If we wish to obtain the sum of the tf values for all the stemmed 
terms in d, we can traverse the wavelet tree upwards for each interval element at leaf d, and obtain 
its tf upon finding its position in L. Alternatively, we could store the tf values aligned to the leaves 
and mark their cumulative values on a compressed bitmap, so as to obtain the sum in constant time 
as the difference of two selecti operations on that bitmap. The space for tf, however, becomes now 
n log ^ + 0(n) bits, which is higher than in our current representation. This method also delivers 
the results in document order. 

Maintaining the tf values aligned to the leaf order yields some support for hierarchical queries. 
Assume a retrievable unit (recall Section 5.2) spans the document range [dl, dr], and thus we wish 
to compute the total tf of t in range [dl, df]. Any such range is exactly covered by O(logm) wavelet 
tree nodes (Lemma 2). We can descend, projecting the range of Lt in L, until those nodes, and 
then add up the accumulated tf values of those O(logm) nodes, in overall time O(logm). 

We can also support temporal and hierarchical documents by restricting our accesses in L t only 
to documents within a range [d m i n , d max ] (recall Section 5.1). It is sufficient to use [d m i n ,d max \ as 
the last argument when we use the range-report query that underlies our support for accessing Lt. 
This automatically yields, for example, Persin's algorithm restricted to a range of documents. 



20 



7 Conclusions 



The wavelet tree data structure [34] has had an enormous impact on the implementation of reduced- 
space text databases. In this article we have shown that it has several other under-explored capa- 
bilities. We have proposed three new algorithms on wavelet trees that solve fundamental problems, 
improving upon the state of the art in some aspects. For range intersections we achieve an adaptive 
complexity that matches the one achieved for sorted ranges. For range quantile and range next value 
problems, we match or get close to the best known time complexities while using less space: basi- 
cally that needed to represent the sequence 5[l,n] plus 0(n) extra bits, versus the O(nlogn) extra 
bits required by previous solutions. Furthermore, if we use compressed bitmap representations [60] 
in our wavelet trees, we retain the time complexities and achieve zero-order compression in the rep- 
resentation of S [34], that is, our overall space including the sequence becomes nHo(S) + 0(n + a), 
where [1,<t] is the alphabet of S and Hq(S) is its empirical zero-order entropy. 

We have also explored a number of applications of those novel algorithms to two areas of 
Information Retrieval (IR): document retrieval on general string databases, and inverted indexes. 
In both cases we obtained support for a number of powerful operations without further increasing 
the space required to support basic ones. 

The algorithms are elegant and simple to implement, so they have the potential to be useful 
in practice. Future work involves implementing them within an IR framework and evaluating their 
practical performance. Although we have used some theoretical data structures for handling bitmaps 
within convenient space bounds, practical variants of rank/ select-capable plain and compressed 
bitmaps, as well as various wavelet tree implementations, are publicly available 11 . Some preliminary 
experiments [25] show that an early version of our results [32] do improve significantly in practice 
upon the previous state of the art on document retrieval for general strings. Our improved versions 
presented in this article should widen the gap. In the case of inverted indexes we do not expect our 
representation to be faster for the basic operations, yet it is likely that it requires less space than 
that of a full-text plus a ranked-retrieval inverted index, and that it is more efficient on sophisticated 
operations. 

Aknowledgements. We thank Jeremy Barbay for his help in understanding the adaptive complexity 
measures for intersections, and Meg Gagie for righting our grammar. 
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